Take a look around the integrated development environment (IDE). This is your code writing box, where you can script code. Below (bottom left) you will see your console and terminal. The console is where you run the code and see results. The terminal allows you to move around in your computer. In the top right, focus on the environment tab. This is where variables, values, and loaded functions are stored. Finally, look to the bottom right. Here you can see files on your computer, see plots, see loaded packages, and get help.
We want to reproduce a figure from the Klein paper. The easiest one is Figure 1A. You should have the data and code to reproduce this figure in the /QSCBC/Data/Klein_fig1 directory. Here is a picture of the figure we are trying to reproduce:
One of the first things you should do when you start an R session is to see what is in your environment.
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.14.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.5.1 magrittr_1.5 tools_3.5.1 htmltools_0.3.6
## [5] yaml_2.2.0 Rcpp_1.0.0 stringi_1.3.1 rmarkdown_1.11
## [9] knitr_1.21 stringr_1.4.0 xfun_0.4 digest_0.6.18
## [13] evaluate_0.13
The next step is to load required packages for your analysis. In this case, you will need two packages: (1) Seurat - a package for scRNAseq data analysis; and (2) dPlyr - a package for data manipulation/summarization.
install.packages("Seurat", repos = "http://cran.us.r-project.org")
## also installing the dependencies 'modeltools', 'R.oo', 'R.methodsS3', 'bibtex', 'gbRd', 'lsei', 'segmented', 'mclust', 'flexmix', 'prabclus', 'diptest', 'kernlab', 'trimcluster', 'proxy', 'R.utils', 'hexbin', 'Rdpack', 'npsurv', 'iterators', 'ROCR', 'mixtools', 'lars', 'ica', 'Rtsne', 'fpc', 'pbapply', 'RANN', 'irlba', 'dtw', 'SDMTools', 'plotly', 'metap', 'lmtest', 'fitdistrplus', 'png', 'doSNOW', 'reticulate', 'foreach', 'hdf5r', 'RcppProgress'
##
## There is a binary version available but the source version is
## later:
## binary source needs_compilation
## plotly 4.7.1 4.8.0 FALSE
##
##
## The downloaded binary packages are in
## /var/folders/m1/_hb5y9zd0db1xg59k1pcrls00000gn/T//RtmpzCQBqx/downloaded_packages
## installing the source package 'plotly'
install.packages("plyr", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/m1/_hb5y9zd0db1xg59k1pcrls00000gn/T//RtmpzCQBqx/downloaded_packages
install.packages("dplyr", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/m1/_hb5y9zd0db1xg59k1pcrls00000gn/T//RtmpzCQBqx/downloaded_packages
Next, you need to get into the directory associated with your data. In this case, my data is in the Klein_fig1 folder in the Data folder of the class folder
setwd('~/QSBSC/Data/Klein_fig1/')
Now, we need to read in the data. For single-cell RNA sequencing, this data is a matrix of cells with read counts. This is a large dataset, so it may take a few seconds.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
raw_counts <- read.table(unz("GSM3067189_04hpf_nm.csv.zip", "GSM3067189_04hpf_nm.csv"),
sep = ",", header = TRUE, row.names = 1)
head(raw_counts)
This dataset is large. To compress the data into a readable form by the R package, create a new object. Use the following settings to clean the dataset.
library(Seurat)
## Loading required package: ggplot2
## Loading required package: cowplot
## Warning: package 'cowplot' was built under R version 3.5.2
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
##
## ggsave
## Loading required package: Matrix
hpf4 <- CreateSeuratObject(raw.data = raw_counts, min.cells = 3,
min.genes = 200, project = "hpf4")
This data came pre-normalized. However, we need to identify features (genes) that define the data, and scale the data so the gene counts are not too low.
library(Seurat)
hpf4 <- FindVariableGenes(hpf4, do.plot = F)
hpf4 <- ScaleData(hpf4, display.progress = T)
## NormalizeData has not been run, therefore ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## Scaling data matrix
Ok, so now we can do some dimensionality reduction and plotting of the data. Principal Component Analysis (PCA) identifies the axes of most variance in the dataset. By projecting the data on the axes with the most variance, one can identify groups that exist in the high dimensional space.
library(Seurat)
hpf4 <- RunPCA(object = hpf4,
pc.genes = hpf4@var.genes,
genes.print = 10)
## [1] "PC1"
## [1] "hnrnpaba" "nnr" "mid1ip1l" "h1m" "snrpd3l"
## [6] "hist1h4l" "zgc:153409" "snrpa" "nucks1a" "mat2aa"
## [1] ""
## [1] "zgc:56676" "rpl11" "zgc:114130" "ccne1"
## [5] "rps7" "LOC563933" "tnpo3" "rps11"
## [9] "magoh" "LOC103911445"
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "cldng" "cldnd" "btg4" "LOC100332229"
## [5] "cth1" "mid1ip1l" "org" "acp5a"
## [9] "thy1" "siva1"
## [1] ""
## [1] "srsf5a" "LOC100334652" "stm" "akap12b"
## [5] "LOC557268" "tfpi2" "nbas" "lrwd1"
## [9] "zgc:92040" "isoc1"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "pho" "mki67" "hrasls"
## [4] "ankrd11" "tfpi2" "hhipl1"
## [7] "si:ch211-116m6.3" "nop56" "acin1a"
## [10] "hist1h4l"
## [1] ""
## [1] "nbas" "rps11" "snrpc"
## [4] "si:dkey-159f12.2" "ccna2" "vgll4l"
## [7] "smad5" "sb:cb288" "gng5"
## [10] "rpl11"
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "uacab" "cth1" "acp5a" "ctnnb1" "exd2"
## [6] "epcam" "zgc:110796" "iqgap1" "pi4kab" "zgc:92040"
## [1] ""
## [1] "rps25" "rps12" "rpl14" "rpl23a" "rps14" "rpl26" "rps9"
## [8] "rps19" "rplp0" "rps17"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "znf1137" "zgc:174275" "LOC100536441" "LOC796940"
## [5] "LOC103910357" "zgc:109934" "zic2b" "fam212aa"
## [9] "snai2" "zgc:174624"
## [1] ""
## [1] "cldne" "cebpb" "grhl3" "stard10" "lye" "gata3"
## [7] "krt8" "ftr82" "neurl1ab" "fzd7a"
## [1] ""
## [1] ""
PCAPlot(object = hpf4)
Another dimensionality reduction technique used in single-cell data analysis is t-distributed Stochastic Neighbor Embedding (tSNE). This is a non-linear reduction approach and can emphasis local and global similarities better than PCA.
library(Seurat)
# tSNE is run on the first 10 principal components (for speed)
hpf4 <- RunTSNE(object = hpf4,
dims.use = 1:10,
do.fast = TRUE)
# Color both replicates the same color
colors <- c("DEW050"="grey", "DEW051"="grey")
TSNEPlot(object = hpf4) +
scale_color_manual("4hpf", values = colors)
So, now we have some semantics. How do we plot data exactly as it appears in the paper? The short answer is that with a tool like tSNE, it is almost impossible. Because the process is stochastic, every run of tSNE can be different. However, the authors do provide information on how each cell was clustered. So, we will overlay that data on the tSNE plot.
library(Seurat)
labels <- read.csv(file = "GSM3067189_04hpf_clustID.txt", header = F,
row.names = row.names(hpf4@meta.data))
colnames(labels) <- "cluster"
hpf4 <- AddMetaData(hpf4, labels, col.name = 'cluster')
hpf4 <- SetAllIdent(object = hpf4, id = 'cluster')
TSNEPlot(object = hpf4)
Ok, starting to look better. Now, we need to change those cluster identifiers to biological identifiers. The authors provide a file for that as well. We will intersect those two pieces of information, and overlay the biological identifiers on the plot.
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(dplyr)
library(Seurat)
names <- read.csv(file = "GSE112294_ClusterNames.csv", header=T)
names_hpf4 <- names[1:4,]
names_hpf4$cluster <- c("Epidermal", "Mesoderm", "Epiblast", "Germline")
names_hpf4_sub <- names_hpf4[c("ClusterID", "cluster")]
colnames(names_hpf4_sub) <- c("cluster", "name")
test_join <- join(hpf4@meta.data, names_hpf4_sub, by = "cluster")
rownames(test_join) <- row.names(hpf4@meta.data)
cluster_names <- test_join["name"]
hpf4 <- AddMetaData(hpf4, cluster_names, col.name = 'name')
hpf4 <- SetAllIdent(object = hpf4, id = 'name')
TSNEPlot(object = hpf4)
Finally, we want to match the colors and add some plot annotations (experiment name and number of cells). Here, the colors and text are added to the plot.
library(Seurat)
TSNEPlot(object = hpf4) + ylim(-35,35) + xlim(-40,35) +
geom_text(x=-25, y=-32, label=sprintf("%d cells", nrow(hpf4@meta.data))) +
geom_text(x=-25, y=32, label="4hpf") +
scale_color_manual(labels=c("Epiblast","Epidermal","Germline","Mesoderm"),
values=c("grey50", "darkgreen", "purple", "red"))
## Warning: Removed 152 rows containing missing values (geom_point).